class: center, middle, inverse, title-slide # rtweet-workshop ## Collecting and analyzing Twitter data ### Michael W. Kearney📊
School of Journalism
Informatics Institute
University of Missouri ###
@kearneymw
@mkearney
--- ## Slides Build these slides on your computer: ``` r source( "https://raw.githubusercontent.com/mkearney/rtweet-workshop/master/R/build-slides.R" ) ``` --- background-image: url(img/logo.png) background-size: 350px auto background-position: 50% 20% class: center, bottom View these slides at [mkearney.github.io/rtweet-workshop](https://mkearney.github.io/rtweet-workshop) --- class: tight ## About {rtweet} - On Comprehensive R Archive Network (CRAN) [](https://opensource.org/licenses/MIT)[](https://cran.r-project.org/package=rtweet) - Growing base of users [](http://depsy.org/package/r/rtweet) - Fairly stable [](https://travis-ci.org/mkearney/rtweet)[](https://codecov.io/gh/mkearney/rtweet?branch=master)[](https://www.tidyverse.org/lifecycle/#maturing) - Package website: [rtweet.info](http://rtweet.info) [](http://rtweet.info/) - Github repo: [mkearney/rtweet](https://github.com/mkearney/rtweet) [](https://github.com/mkearney/rtweet/)[](https://github.com/mkearney/rtweet/) --- ## Install - Install **{rtweet}** from [CRAN](https://cran.r-project/package=rtweet). ```r install.packages("rtweet") ``` - Or install the **development version** from [Github](https://github.com/mkearney/rtweet). ```r devtools::install_github("mkearney/rtweet") ``` - Load **{rtweet}** ```r library(rtweet) ``` --- class: inverse, center, middle # Accessing web APIs --- ## Some background **Application Program Interfaces** (APIs) are sets of protocols that govern interactions between sites and users. APIs are similar to web browsers but with different purpose: - Web browsers **render** content - Web APIs manage and organize **data** For public APIs, many sites only allow **authorized** users - Twitter, Facebook, Instagram, Github, etc. --- ## developer.twitter.com To create a token with write and DM-read access, users must... 1. Apply and get approved for a developer account with Twitter 1. Create a Twitter app (fill out a form) For step-by-step instructions on how to create a Twitter app and corresponding token, see **[rtweet.info/articles/auth.html](https://rtweet.info/articles/auth.html) --- class: inverse, center, middle # Twitter Data! --- class: inverse, center, middle # 1. <br /> Getting friends/followers --- ## Friends/followers Twitter's API documentation distinguishes between **friends** and **followers**. + **Friend** refers to an account a given user follows + **Follower** refers to an account following a given user --- ## `get_friends()` Get user IDs of accounts **followed by** (AKA friends) [@jack](https://twitter.com/jack), the co-founder and CEO of Twitter. ```r fds <- get_friends("jack") fds ``` --- ## `get_friends()` Get friends of **multiple** users in a single call. ```r fds <- get_friends( c("hadleywickham", "NateSilver538", "Nate_Cohn") ) fds ``` --- ## `get_followers()` Get user IDs of accounts **following** (AKA followers) [@mizzou](https://twitter.com/mizzou). ```r mu <- get_followers("mizzou") mu ``` --- ## `get_followers()` Unlike friends (limited by Twitter to 5,000), there is **no limit** on the number of followers. To get user IDs of all 55(ish) million followers of @realDonaldTrump, you need two things: 1. A stable **internet** connection 1. **Time** – approximately five and a half days --- ## `get_followers()` Get all of Donald Trump's followers. ```r ## get all of trump's followers rdt <- get_followers( "realdonaldtrump", n = 56000000, retryonratelimit = TRUE ) ``` --- class: inverse, center, middle # 2. <br /> Searching for tweets --- ## `search_tweets()` Search for one or more keyword(s) (note: implicit `AND` between words) ```r rds <- search_tweets(q = "rstats data science") rds ``` --- ## `search_tweets()` Search for exact phrase ```r ds <- search_tweets('"data science"') ds ``` --- ## `search_tweets()` Search for keyword(s) and phrase ```r rpds <- search_tweets("rstats python \"data science\"") rpds ``` --- ## `search_tweets()` By default, `search_tweets()` returns up to 100 of the most recent matching tweets. To return more, set `n` to a higher number (for normal tokens, the rate limit is 18,000 every fifteen minutes). ```r rstats <- search_tweets(q = "rstats", n = 10000) rstats ``` --- ## `search_tweets()` **PRO TIP #1**: Use `bearer_token()` to increase rate limit to 45,000 per fifteen minutes. ```r mosen <- search_tweets( "mccaskill OR hawley", n = 45000, token = bearer_token() ) ``` --- ## `search_tweets()` **PRO TIP #2**: Get the firehose for free by searching for tweets posted by verified or non-verified tweets. ```r fff <- search_tweets( "filter:verified OR -filter:verified", n = 45000, token = bearer_token() ) ts_plot(fff, "secs") ``` --- <p style="text-align:center"> <img src="img/fff.png" /> </p> --- ## `search_tweets()` **PRO TIP #3**: Use search operators provided by Twitter, e.g., filter by language and exclude retweets ```r rt <- search_tweets("rstats", lang = "en", include_rts = FALSE) ``` --- ## `search_tweets()` search geolocation (ex: tweets within 25 miles of Columbia, MO) ```r como <- search_tweets(geocode = "38.9517,-92.3341,25mi", n = 1000) como <- lat_lng(como) par(mar = c(0, 0, 0, 0)) maps::map("state", fill = TRUE, col = "#ffffff", lwd = .25, mar = c(0, 0, 0, 0), xlim = c(-96, -89), y = c(35, 41)) with(como, points(lng, lat, pch = 20, col = "red")) ``` --- <p style="text-align:center"> <img src="img/como.png" /> </p> --- ## `search_tweets()` Filter by the type of device that posted the tweet. ```r rt <- search_tweets("lang:en", n = 300, source = '"Twitter for iPhone"') ``` --- class: inverse, center, middle # 3. <br /> User timelines --- ## `get_timeline()` Get the most recent tweets posted by a user. ```r cnn <- get_timeline("cnn") ``` --- ## `get_timeline()` Get up to the most recent 3,200 tweets (endpoint max) posted by multiple users. ```r nws <- get_timeline(c("cnn", "foxnews", "msnbc"), n = 3200) ``` --- ## `ts_plot()` Group by `screen_name` and plot hourly frequencies of tweets. ```r nws %>% dplyr::group_by(screen_name) %>% ts_plot("hours") ``` --- class: inverse, center, middle # 4. <br /> User favorites --- ## `get_favorites()` Get up to the most recent 3,000 tweets favorited by a user. ```r kmw_favs <- get_favorites("kearneymw", n = 3000) ``` --- ## `lookup_tweets()` ```r ## `lookup_tweets()` status_ids <- c("947235015343202304", "947592785519173637", "948359545767841792", "832945737625387008") twt <- lookup_tweets(status_ids) ``` --- ## Users ```r ## `lookup_users()` ## screen names users <- c("hadleywickham", "NateSilver538", "Nate_Cohn") usr <- lookup_users(users) ``` --- ## `stream_tweets()` ```r ## `stream_tweets()` ## - "Random" **sample** st <- stream_tweets(q = "", timeout = 30) ## - **Filter** by keyword st <- stream_tweets(q = "realDonaldTrump,Mueller", timeout = 30) st ts_plot(st, "secs") ``` --- ## `stream_tweets()` ```r ## - **Locate** by bounding box st <- stream_tweets(q = lookup_coords("world"), timeout = 30) stl <- lat_lng(st) maps::map("world") points(stl$lng, stl$lat, cex = 1, pch = 21, col = "#550000cc", bg = "#dd3333cc") table(stl$country) ``` --- ## `stream_tweets()` ```r tweet_source_data <- search_tweets( '(filter:verified OR -filter:verified) AND (source:"Twitter for iPhone" OR source:"Twitter for Android")', include_rts = FALSE, token = bearer_token(), n = 40000) table(tweet_source_data$source) ``` --- ## Sentiment ```r sent <- syuzhet::get_sentiment(tweet_source_data$text) tweet_source_data$sent <- sent dplyr::mutate(tweet_source_data, sent = sent) syuzhet::get_nrc_sentiment(tweet_source_data$text[1:50]) tweet_source_data <- tweet_source_data %>% dplyr::mutate( sent = syuzhet::get_sentiment(text) ) ``` --- ## Features ```r tf <- textfeatures::textfeatures(tweet_source_data) library(tidyverse) library(gbm) table(tweet_source_data$source) tf$y <- tweet_source_data$source == "Twitter for iPhone" ``` --- ## Machine learning data ```r ## all tweets v <- search_tweets('filter:verified', n = 300) nv <- search_tweets('-filter:verified', n = 300) v <- dplyr::bind_rows(v, nv) ``` --- ## Machine learning ```r m1 <- gbm(y ~ ., data = tf[1:15000, -1], n.trees = 200) p <- predict(m1, newdata = tf[15001:nrow(tf), -1], type = "response", n.trees = 200) table(p > .50, tf$y[15001:nrow(tf)]) (2769 + 1245) / (756 + 1530 + 2769 + 1245) summary(m1) ``` --- ## Group/summarise ```r tweet_source_data %>% group_by(source) %>% summarise( n = n(), users = n_distinct(user_id), chars = mean(nchar(text)), sent = mean(sent, na.rm = TRUE) ) ``` --- ## List members ```r lists_members() bp <- tweetbotornot::tweetbotornot(c( "kearneymw", "realdonaldtrump", "netflix_bot", "tidyversetweets", "thebotlebowski", "rodhart99")) bp ``` --- ## Tweetbotornot ```r remotes::install_github("mkearney/tweetbotornot") remotes::install_github("mkearney/textfeatures") ## install.package("quanteda") ```